In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.[1] Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems. For example, the Latin script supports English, French, German, Italian, Vietnamese and Latin. Some languages make use of multiple alternate writing systems, thus also use several scripts. In Turkish, the Arabic script was used before the 20th century, but transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.
Complementary are the Unicode symbols: scripts and symbols cover all Unicode characters. The unified diacritical characters and unified punctuation characters frequently have the ācommonā or āinheritedā script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.
Unicode 6.0 includes 26 ancient and historic scripts and 67 modern scripts. Unicode is actively working on many more as indicated by its roadmap.
Contents[hide] |
When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character āĆ„ā (sometimes called a āSwedish Oā) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicodeās flexible scripts, combining marks and collation algorithms.
Unicode can assign a character in the UCS to a single script only. However, many characters ā those that are not part of a formal natural language writing system or are unified across many writing systems may be used in more than one script. For example, currency signs, symbols, numerals and punctuation marks. In these cases Unicode defines them as belonging to the common script (ISO 15924 code "Zyyy"). All in all Unicode has 6379 characters defined as "Common" script.
In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script. In these cases Unicode assigns them to the inherited script (ISO 15924 code Zinh), which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 Ģ combining diaeresis may combine with either U+0065 e latin small letter e to create a Latin "Ć«", or with U+0435 Šµ cyrillic small letter ie for the Cyrillic "Ń". In the former case it inherits the Latin script of the base character whereas in the latter case it inherits the Cyrillic script of the base character. 523 Characters in Unicode are of the inherited script.
Ancient and historic scripts in Unicode[1] |
---|
^ Unicode. As of version 5.2 (BrÄhmÄ«: 6.0) |
Unicode includes 25 ancient scripts (out of use a thousand years or more) and historic scripts (out of use several hundred years)[2]
See also: phonemic and phonetic orthography.
"Writing system" is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.
Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.
Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.
Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as Ē² (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.
Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as āother letterā or āmodifier letterā. Ideographs such as Unihan ideographs are also categorized as āother lettersā. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.
Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.
Unicode defines 97 script names (called "Alias" or "Property value alias"), based on the ISO 15924 list, that are used in Unicode 6.0.[3] These 97 contain 25 ancient or historic scripts, the generic Zyyy Common (Code for undetermined script) script name for characters that are used in multiple script like diacritics, and the general Zzzz Unknown (Code for undetermined script). Not used are, among others, the script codes: Zsym (Symbols) and Zmth (Mathematical notation). These are considered not to be scripts in Unicode sense.
ISO 15924 script codes[a][b] and Unicode[c][d] | |||||||
---|---|---|---|---|---|---|---|
ISO 15924 | script in Unicode[e] | ||||||
Code | Nr | Name | Alias[f] | DirecĀtion | VerĀsion | CharĀacters | Remark |
Afak | 439 | Afaka | Not in Unicode | ||||
Arab | 160 | Arabic | Arabic | R-to-L | 1.0 | 1,051 | |
Armi | 124 | Imperial Aramaic | Imperial Aramaic | R-to-L | 5.2 | 31 | Ancient/historic |
Armn | 230 | Armenian | Armenian | L-to-R | 1.0 | 90 | |
Avst | 134 | Avestan | Avestan | R-to-L | 5.2 | 61 | Ancient/historic |
Bali | 360 | Balinese | Balinese | L-to-R | 5.0 | 121 | |
Bamu | 435 | Bamum | Bamum | L-to-R | 5.2 | 657 | |
Bass | 259 | Bassa Vah | ? | (36) | Provisionally accepted for Unicode[g] | ||
Batk | 365 | Batak | Batak | L-to-R | 6.0 | 56 | |
Beng | 325 | Bengali | Bengali | L-to-R | 1.0 | 92 | |
Blis | 550 | Blissymbols | Not in Unicode | ||||
Bopo | 285 | Bopomofo | Bopomofo | L-to-R | 1.0 | 70 | |
Brah | 300 | Brahmi | Brahmi | L-to-R | 6.0 | 108 | Ancient/historic |
Brai | 570 | Braille | Braille | L-to-R | 3.0 | 256 | |
Bugi | 367 | Buginese | Buginese | L-to-R | 4.1 | 30 | |
Buhd | 372 | Buhid | Buhid | L-to-R | 3.2 | 20 | |
Cakm | 349 | Chakma | 6.1? | 67? | Included in beta release of Unicode 6.1.0[h] | ||
Cans | 440 | Unified Canadian Aboriginal Syllabics | Canadian Aboriginal | L-to-R | 3.0 | 710 | |
Cari | 201 | Carian | Carian | L-to-R | 5.1 | 49 | Ancient/historic |
Cham | 358 | Cham | Cham | L-to-R | 5.1 | 83 | |
Cher | 445 | Cherokee | Cherokee | L-to-R | 3.0 | 85 | |
Cirt | 291 | Cirth | Not in Unicode | ||||
Copt | 204 | Coptic | Coptic | L-to-R | 1.0 | 135 | (disunified from Greek in 4.1) Ancient/historic |
Cprt | 403 | Cypriot | Cypriot | R-to-L | 4.0 | 55 | Ancient/historic |
Cyrl | 220 | Cyrillic | Cyrillic | L-to-R | 1.0 | 408 | |
Cyrs | 221 | Cyrillic (Old Church Slavonic variant) | Not in Unicode | ||||
Deva | 315 | Devanagari (Nagari) | Devanagari | L-to-R | 1.0 | 150 | |
Dsrt | 250 | Deseret (Mormon) | Deseret | L-to-R | 3.1 | 80 | |
Dupl | 755 | Duployan shorthand, Duployan stenography | ? | (143) | Provisionally accepted for Unicode[g] | ||
Egyd | 070 | Egyptian demotic | Not in Unicode | ||||
Egyh | 060 | Egyptian hieratic | Not in Unicode | ||||
Egyp | 050 | Egyptian hieroglyphs | Egyptian Hieroglyphs | L-to-R | 5.2 | 1,071 | Ancient/historic |
Elba | 226 | Elbasan | ? | (40) | Provisionally accepted for Unicode[g] | ||
Ethi | 430 | Ethiopic (GeŹ»ez) | Ethiopic | L-to-R | 3.0 | 495 | |
Geok | 241 | Khutsuri (Asomtavruli and Nuskhuri) | Not in Unicode | ||||
Geor | 240 | Georgian (Mkhedruli) | Georgian | L-to-R | 1.0 | 120 | |
Glag | 225 | Glagolitic | Glagolitic | L-to-R | 4.1 | 94 | Ancient/historic |
Goth | 206 | Gothic | Gothic | L-to-R | 3.1 | 27 | Ancient/historic |
Gran | 343 | Grantha | Not in Unicode | ||||
Grek | 200 | Greek | Greek | L-to-R | 1.0 | 511 | |
Gujr | 320 | Gujarati | Gujarati | L-to-R | 1.0 | 83 | |
Guru | 310 | Gurmukhi | Gurmukhi | L-to-R | 1.0 | 79 | |
Hang | 286 | Hangul (HangÅl, Hangeul) | Hangul | L-to-R | 1.0 | 11,739 | Hangul syllables relocated in 2.0 |
Hani | 500 | Han (Hanzi, Kanji, Hanja) | Han | L-to-R | 1.0 | 75,960 | |
Hano | 371 | Hanunoo (HanunĆ³o) | Hanunoo | L-to-R | 3.2 | 21 | |
Hans | 501 | Han (Simplified variant) | Subset Hani | ||||
Hant | 502 | Han (Traditional variant) | Subset Hani | ||||
Hebr | 125 | Hebrew | Hebrew | R-to-L | 1.0 | 133 | |
Hira | 410 | Hiragana | Hiragana | L-to-R | 1.0 | 91 | |
Hluw | 080 | Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) | Not in Unicode | ||||
Hmng | 450 | Pahawh Hmong | Not in Unicode | ||||
Hrkt | 412 | Japanese syllabaries (alias for Hiragana + Katakana) | Katakana or Hiragana | See Hira, Kana | |||
Hung | 176 | Old Hungarian | ? | (109) | Provisionally accepted for Unicode[g] | ||
Inds | 610 | Indus (Harappan) | Not in Unicode | ||||
Ital | 210 | Old Italic (Etruscan, Oscan, etc.) | Old Italic | L-to-R | 3.1 | 35 | Ancient/historic |
Java | 361 | Javanese | Javanese | L-to-R | 5.2 | 91 | |
Jpan | 413 | Japanese (alias for Han + Hiragana + Katakana) | See Hani, Hira and Kana | ||||
Jurc | 510 | Jurchen | Not in Unicode | ||||
Kali | 357 | Kayah Li | Kayah Li | L-to-R | 5.1 | 48 | |
Kana | 411 | Katakana | Katakana | L-to-R | 1.0 | 300 | |
Khar | 305 | Kharoshthi | Kharoshthi | R-to-L | 4.1 | 65 | Ancient/historic |
Khmr | 355 | Khmer | Khmer | L-to-R | 3.0 | 146 | |
Khoj | 322 | Khojki | Not in Unicode | ||||
Knda | 345 | Kannada | Kannada | L-to-R | 1.0 | 86 | |
Kore | 287 | Korean (alias for Hangul + Han) | See Hani and Hang | ||||
Kpel | 436 | Kpelle | Not in Unicode | ||||
Kthi | 317 | Kaithi | Kaithi | L-to-R | 5.2 | 66 | Ancient/historic |
Lana | 351 | Tai Tham (Lanna) | Tai Tham | L-to-R | 5.2 | 127 | |
Laoo | 356 | Lao | Lao | L-to-R | 1.0 | 65 | |
Latf | 217 | Latin (Fraktur variant) | L-to-R | typographic variant of Latin | |||
Latg | 216 | Latin (Gaelic variant) | L-to-R | typographic variant of Latin | |||
Latn | 215 | Latin | Latin | L-to-R | 1.0 | 1,267 | |
Lepc | 335 | Lepcha (RĆ³ng) | Lepcha | L-to-R | 5.1 | 74 | |
Limb | 336 | Limbu | Limbu | L-to-R | 4.0 | 66 | |
Lina | 400 | Linear A | ? | (341) | Provisionally accepted for Unicode[g] | ||
Linb | 401 | Linear B | Linear B | L-to-R | 4.0 | 211 | Ancient/historic |
Lisu | 399 | Lisu (Fraser) | Lisu | L-to-R | 5.2 | 48 | |
Loma | 437 | Loma | Not in Unicode | ||||
Lyci | 202 | Lycian | Lycian | L-to-R | 5.1 | 29 | Ancient/historic |
Lydi | 116 | Lydian | Lydian | R-to-L | 5.1 | 27 | Ancient/historic |
Mand | 140 | Mandaic, Mandaean | Mandaic | R-to-L | 6.0 | 29 | |
Mani | 139 | Manichaean | ? | (51) | Provisionally accepted for Unicode[g] | ||
Maya | 090 | Mayan hieroglyphs | Not in Unicode | ||||
Mend | 438 | Mende | Not in Unicode | ||||
Merc | 101 | Meroitic Cursive | 6.1? | 26? | Included in beta release of Unicode 6.1.0[h] | ||
Mero | 100 | Meroitic Hieroglyphs | 6.1? | 32? | Included in beta release of Unicode 6.1.0[h] | ||
Mlym | 347 | Malayalam | Malayalam | L-to-R | 1.0 | 98 | |
Mong | 145 | Mongolian | Mongolian | L-to-R | 3.0 | 153 | Includes Clear, Manchu scripts |
Moon | 218 | Moon (Moon code, Moon script, Moon type) | Not in Unicode | ||||
Mroo | 199 | Mro, Mru | ? | (43) | Provisionally accepted for Unicode[g] | ||
Mtei | 337 | Meitei Mayek (Meithei, Meetei) | Meetei Mayek | L-to-R | 5.2 | 56 | |
Mymr | 350 | Myanmar (Burmese) | Myanmar | L-to-R | 3.0 | 188 | |
Narb | 106 | Old North Arabian (Ancient North Arabian) | ? | (32) | Provisionally accepted for Unicode[g] | ||
Nbat | 159 | Nabataean | ? | (40) | Provisionally accepted for Unicode[g] | ||
Nkgb | 420 | Nakhi Geba ('Na-'Khi Ā²GgÅ-Ā¹baw, Naxi Geba) | Not in Unicode | ||||
Nkoo | 165 | NāKo | N'Ko | R-to-L | 5.0 | 59 | |
Nshu | 499 | NĆ¼shu | ? | (389) | Provisionally accepted for Unicode[g] | ||
Ogam | 212 | Ogham | Ogham | L-to-R | 3.0 | 29 | Ancient/historic |
Olck | 261 | Ol Chiki (Ol Cemetā, Ol, Santali) | Ol Chiki | L-to-R | 5.1 | 48 | |
Orkh | 175 | Old Turkic, Orkhon Runic | Old Turkic | R-to-L | 5.2 | 73 | Ancient/historic |
Orya | 327 | Oriya | Oriya | L-to-R | 1.0 | 90 | |
Osma | 260 | Osmanya | Osmanya | L-to-R | 4.0 | 40 | |
Palm | 126 | Palmyrene | ? | (32) | Provisionally accepted for Unicode[g] | ||
Perm | 227 | Old Permic | Not in Unicode | ||||
Phag | 331 | Phags-pa | Phags-pa | L-to-R | 5.0 | 56 | Ancient/historic |
Phli | 131 | Inscriptional Pahlavi | Inscriptional_Pahlavi | 5.2 | 27 | Ancient/historic | |
Phlp | 132 | Psalter Pahlavi | Not in Unicode | ||||
Phlv | 133 | Book Pahlavi | Not in Unicode | ||||
Phnx | 115 | Phoenician | Phoenician | R-to-L | 5.0 | 29 | Ancient/historic |
Plrd | 282 | Miao (Pollard) | 6.1? | 133? | Included in beta release of Unicode 6.1.0[h] | ||
Prti | 130 | Inscriptional Parthian | Inscriptional Parthian | R-to-L | 5.2 | 30 | Ancient/historic |
Qaaa | 900 | Reserved for private use (start) | Not in Unicode | ||||
Qaai | 908 | (Private use) | Inherited | 523 | In versions prior to 5.2 (from 5.2: 'Zinh') | ||
Qabx | 949 | Reserved for private use (end) | Not in Unicode | ||||
Rjng | 363 | Rejang (Redjang, Kaganga) | Rejang | L-to-R | 5.1 | 37 | |
Roro | 620 | Rongorongo | Not in Unicode | ||||
Runr | 211 | Runic | Runic | L-to-R | 3.0 | 78 | Ancient/historic |
Samr | 123 | Samaritan | Samaritan | R-to-L | 5.2 | 61 | |
Sara | 292 | Sarati | Not in Unicode | ||||
Sarb | 105 | Old South Arabian | Old South Arabian | R-to-L | 5.2 | 32 | Ancient/historic |
Saur | 344 | Saurashtra | Saurashtra | L-to-R | 5.1 | 81 | |
Sgnw | 095 | SignWriting | Not in Unicode | ||||
Shaw | 281 | Shavian (Shaw) | Shavian | L-to-R | 4.0 | 48 | |
Shrd | 319 | Sharada, ÅÄradÄ | 6.1? | 83? | Included in beta release of Unicode 6.1.0[h] | ||
Sind | 318 | Khudawadi, Sindhi | Not in Unicode | ||||
Sinh | 348 | Sinhala | Sinhala | L-to-R | 3.0 | 80 | |
Sora | 398 | Sora Sompeng | 6.1? | 35? | Included in beta release of Unicode 6.1.0[h] | ||
Sund | 362 | Sundanese | Sundanese | L-to-R | 5.1 | 55 | |
Sylo | 316 | Syloti Nagri | Syloti Nagri | L-to-R | 4.1 | 44 | |
Syrc | 135 | Syriac | Syriac | R-to-L | 3.0 | 77 | |
Syre | 138 | Syriac (Estrangelo variant) | Not in Unicode | ||||
Syrj | 137 | Syriac (Western variant) | Not in Unicode | ||||
Syrn | 136 | Syriac (Eastern variant) | Not in Unicode | ||||
Tagb | 373 | Tagbanwa | Tagbanwa | L-to-R | 3.2 | 18 | |
Takr | 321 | Takri, į¹¬ÄkrÄ«, į¹¬Äį¹ krÄ« | 6.1? | 66? | Included in beta release of Unicode 6.1.0[h] | ||
Tale | 353 | Tai Le | Tai Le | L-to-R | 4.0 | 35 | |
Talu | 354 | New Tai Lue | New Tai Lue | L-to-R | 4.1 | 83 | |
Taml | 346 | Tamil | Tamil | L-to-R | 1.0 | 72 | |
Tang | 520 | Tangut | ? | (5,910) | Provisionally accepted for Unicode[g] | ||
Tavt | 359 | Tai Viet | Tai Viet | L-to-R | 5.2 | 72 | |
Telu | 340 | Telugu | Telugu | L-to-R | 1.0 | 93 | |
Teng | 290 | Tengwar | Not in Unicode | ||||
Tfng | 120 | Tifinagh (Berber) | Tifinagh | L-to-R | 4.1 | 57 | |
Tglg | 370 | Tagalog (Baybayin, Alibata) | Tagalog | L-to-R | 3.2 | 20 | |
Thaa | 170 | Thaana | Thaana | R-to-L | 3.0 | 50 | |
Thai | 352 | Thai | Thai | L-to-R | 1.0 | 86 | |
Tibt | 330 | Tibetan | Tibetan | L-to-R | 1.0 | 207 | (removed in 1.1 and reintroduced in 2.0) |
Tirh | 326 | Tirhuta | Not in Unicode | ||||
Ugar | 040 | Ugaritic | Ugaritic | L-to-R | 4.0 | 31 | Ancient/historic |
Vaii | 470 | Vai | Vai | L-to-R | 5.1 | 300 | |
Visp | 280 | Visible Speech | Not in Unicode | ||||
Wara | 262 | Warang Citi (Varang Kshiti) | Not in Unicode | ||||
Wole | 480 | Woleai | Not in Unicode | ||||
Xpeo | 030 | Old Persian | Old Persian | L-to-R | 4.1 | 50 | Ancient/historic |
Xsux | 020 | Cuneiform, Sumero-Akkadian | Cuneiform | L-to-R | 5.0 | 982 | Ancient/historic |
Yiii | 460 | Yi | Yi | L-to-R | 3.0 | 1,220 | |
Zinh | 994 | Code for inherited script | Inherited | In version 5.2 (prior versions: 'Qaai') | |||
Zmth | 995 | Mathematical notation | Not a 'script' in Unicode | ||||
Zsym | 996 | Symbols | Not a 'script' in Unicode | ||||
Zxxx | 997 | Code for unwritten documents | Not in Unicode | ||||
Zyyy | 998 | Code for undetermined script | Common | 6,379 | |||
Zzzz | 999 | Code for uncoded script | Unknown | all other code points | |||
Notes
|
|